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Abstract 

In many multi-class classification problems, the misclassification rate as an error measure is not 
the relevant choice, think of the imbalanced classes problems. In order to overcome this shortcoming, 
several methods have been proposed where the error measure embeds richer informations than the 
mere misclassification rate. Yet, to the best of our knowledge, none of these methods makes use of 
one of the most natural tools in the multi-class setting: the confusion matrix. 

Recent results show that using the norm of the confusion matrix as an error measure can be 
quite interesting due to the additional informations contained in the matrix, especially in the case 
of imbalanced classes. In this paper, we show step by step how to obtain a boosting-based method 
which minimizes the norm of the confusion matrix. The experimental results point out that the 
proposed method performs better that AdaBoost.MM on imbalanced datasets, while both methods 
are equivalent on balanced datasets. 

Keywords: Multi-class Learning, Classification, Imbalanced Learning, Boosting, Confusion Matrix 

1 Introdution 

Learning from imbalanced data concerns theory and algorithms that process a relevant learning task when- 
ever data is not uniformely distributed among cl asses. When fac ing imbalanced classes, the classification 
accuracy is not the fair measure to be optimized iFawcettI (|2006[) . Accuracy can be quite high in case of 
extreme imbalanced data: majority classes are promoted, while minority classes are not recognized. Such 
a bias gets stronger within the multi-class setting 

In the binary setting, learning from im balanced data has be en quite studied over the past years, leading 
to many algorithms and theoretical results iHe fc Garcial (|2009). It is mostly achieved by either resampling 
methods for rebalancing the data over cla sses (for exa mple Estabrooks et al. ( 2004 )'). or/and by dealing 
with cost-sensitive methods (for example Tina ( 20pof)'). or with add i tional assumptions such as active 
learning within kernel-based methods (for example: Bordes &: Bottoul (j2005l) ). 

Despite the famous words "it is easy to generalize to more than two classes" , learning imbalanced data 
within a multi-class or multi-label setting is still an open research problem, which is sometimes adressed 
through the study of some alternate measures of interest. Most of times, generalizing the binary setting to 
the multi-class setting is based on the one-vs-all (or one-vs-one) usual trade-off. It is worth to notice that 
some specific learning tasks have been addressed through the optimization of relevant measures within 
the multi-class imb alanced setting, whatever the ave rage accur a cy co uld be. For example, let us cite: 
Chapelle fc Chang, studies some ranking measures, Yue et al] ( 2007) focused o n the maximization of 
the Mean Average Precision (MAP) in the multi-label setting. IWang et al. (2012) addresses imbalanced 
feature selection th rough the maximi zation of the MAUC, the multi-class extension of the Area Under the 
ROC Curve, while K. Tang fc 2011.1 maximizes the MAUC for improving classification. Meanwhile, the 
correl ations between these alternative mea sures and accuracy have been partly studied ICortes fc Mohri 
(j2004l ) without any theoretical result so far iHe fc Garcia (1200^ ). 

Furthermore, the confusion matrix is one of the most informative measure a multi-class learning system 
can rely on. Among other information, it contains the ways : 



the classifier gets right or wrong on one class. 



and the amount of confusion among (imbalanced) classes. 
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The sum of the entries of a row of the confusion matrix is equal to 1 independently from the number of 
examples having the class corresponding to the row. As such the confusion matrix constitutes a great 
tool that can be used to overcome the imbalanced classes problem. Moreover, if we consider the matrix 
containing only the non-diagonal elements, than summing over a row of this new matrix can be quite 
informing of how the corresponding class is recognized over the learning problem. Surprisingly, as far 
as we know, no-one proposes an algorithm that would optimize a metric computed from the confusion 
matrix. 

In this work, we advocate that minimizing the norm of the confusion matrix is helpful for smoothing 
the accuracy among imbalanced classes, so that minority classes are considered as important as majority 
classes. We thus work on a multi-class learning framework, based on the confusion matrix. As far as we 
know, this work presents the first multi-class learning algorithm that minimizes the norm of the confusion 
matrix. 

Starting from a strong multi-class classification theoreti cal setting Mukheriee fc Schaoire (1201111 , and 
helped by previous recent works on the confusion matrix [Morvant et al.. (.201211 : Ralaivola ()2012l ). the 
aim of this paper is to sketch up a computationally and theoretically fair classification algorithm (section 
2]) that is ensured to minimize the norm of the confusion matrix, minimizing the classification error as 
proven in section[3] Boosting based, this algorithm greedly processes a sort-of regularization on imbalanced 
classes, in such a way that poorly represented classes are still of interest within the overall learning process, 
independently from any prior misclassificat ion cost. Section [51 summeriz es the experimental resultats of 
this algorithm, compared to Adaboost.MM lMukheriee fc Schapird (j201lh . Section [6] wraps it all up with 
a discussion on the contributions of this paper and the future works. 



2 General framework and Notations 

The method proposed in this paper uses the confusion matrix as a performance measure in order to build 
a multi-class classifier from a boosting-based process. Before attacking the core of the problem and our 
main contribution, we introduce the different notations used throughout the paper. 

The first part of this section contains the notations on the matrices and some of the tools that will be 
used to transform a confusion matrix towards an error measure. The second part introduces the boosting 
framework used to obtain the method that constitutes the main contribution of this paper. 



2.1 General notations 

The matrices are noted with bold capital letters like C and C(/,j), or simply cj.j, corresponds to the 
entry of the Ith row and the jth column of C. Xmax{C) and Tr{C) correspond respectively to the largest 
eigenvalue and the trace of C, while ||C|| is its spectral or operator norm. ||C|| is defined as the square 
root of the largest eigenvalue of C*C, where C* is the conjugate transpose of C. Let A and B be two 
matrices, then AB and A • B refer respectively to the inner product and the Frobenius inner product of 
A and B. 

The indicator function is denoted by I and, unless stated otherwise, K is the number of classes, m the 
number of examples and niy if the number of examples of class y, where y e {1, K}. 



2.2 Multi-class boosting framework 

In this paper we make use of the boosting framework for multiclass classification introduced in Mukheriee fc Schapird 



(|2011l ). and more precisely the one defined for AdaBoost.MM. In this setting the distribution on the train- 
ing examples is replaced by a cost matrix. Let S = {{xi,y,i)} be a training sample, where Xi £ X 
and yi £ {!,..., K}. The cost matrix D is constructed so that for a given example {xi,yi), V/ ^ yt 
Ui) ^ 0: whcrc i is the row of D corresponding to (x^, yi). 
In the case of AdaBoost.MM, the cost matrix C at iteration T, is defined as follows: 

'exp{fT{i,l) - fT{i,yi)) iil^Vz 

Dt(*,0'=<| (1) 

-E exp{fT{i,j)~fT{i,yi}) otherwise. 
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where friij) is the score function computed as: 



T 



(2) 



t=i 



When needed, /t(*,0 will be noted as fT,i,i for readability reasons. 

The output hypothesis of AdaBoost.MM is given by the following expression : 



H{x) = argmax fT{i,l). 

l=l-K 



(3) 



3 Boosting the confusion matrix 

3.1 The confusion matrix as an error measure 

Boosting algorithms, such as the AdaBoost family (AdaBoost, AdaBoost.Ml, AdaBoost.MM,- •• ), are 

designed in order to greedily minimize the empirical error computed on the training sample. Their goal 
is therefore to iteratively construct a classifier H which minimizes : 



The loss functions considered in these methods reflect this goal, that is they take into consideration 
only the number of examples misclassiflcd by H, independently from their class. For example, in the 
case of AdaBoost, the exponential loss forces the weak learners to focus on the most difficult examples. 
However, as mentioned in the introduction, in the case of imbalancxxl classes, this may not be optimal, 
since the weak learner could possibly be focused only on the examples of one class. Take for example, a 
binary classification problem where one of the classes makes up 99% of the training sample. The simple 
"majority class" classifier would have an error of at most 1%. Nevertheless its generalization capabilities 
would be catastrophic, since it can only recognize one class. 

It would therefore be more preferable to have another error measure, and another loss function, which 
can take into considerations richer informations. In this paper, we propose to use a particular form of the 
confusion matrix of a classifier as an error measure. We start off by defining the true confusion matrix 
and the empirical confusion matrix for a classifier h. 

Definition 1. (True confusion matrix) The true confusion matrix A of a classifier h over a distribution 
V is defined as : 



Definition 2. (Empirical confusion matrix) For a given classifier h and a sample S = {{xi,yi)}i^i ^ T), 
the empirical confusion matrix As of h is defined as : 



One may notice that the entries of a row of the confusion matrix sum up to 1 , independently from the 
number of examples contained in the corresponding class. The diagonal entries of this matrix correspond 
to the correctly classified examples. Since our aim is to use the confusion matrix as an error measure, 
we zero these diagonal elements. The following definition gives the general terms of the new confusion 

matrices: 

Definition 3. For all h G H we define the empirical and true confusion matrices of h by respectively 
Cs = {cij)i<ij<K and C = {cij)i<ij<K such that for all {l,j): 



1 



m 



m 



i=l 



\/l,j e {1,...,K}, a,,/='E,|,=,l(/i(x) =i) 

= P(x,,/)^j)(Mx) =i|y = 0- 





(4) 



(5) 
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Let p = [P{y = 1), P{y = K)] be the vector of class priors distribution, then using the new definition 
of the confusion matrix, it is easy to see that, for a given classifier h: 

Rih)"^ Pi.,y)Mh{x) ^y) = ||C||i (6) 

where ||C||i is the Zl-norm of matrix C. This simple, yet beautiful, result means that it is possible to 
retrieve the true error rate of h from its confusion matrix. 

In this paper we focus on the operator norm of the confusion matrix, which is given by the square root 
of the largest eigenvalue of the matrix. Using the result in equation [5] and the equivalency between the 
norms, we have the following relation between the operator norm and the true risk: 

R{h)<VK\\C\\ (7) 

Both equation [S] and equation [7] imply that minimizing the norm of the confusion matrix can be a 
good strategy in order to have a small risk. 



3.2 Bounding the confusion matrix 

The result given in equation [7] bounds the operator norm of the true confusion matrix, but it is difficult to 
use in a practical case, since the underlying distribution T) is unknown. In order to overcome this difficulty, 
we make use of a particular case of Theorem 1 given in iRalaivola This particular case consist in 



choosing the indicator function I as the loss function. The following corollary is the direct consequence of 
this choice applied to the theorem. 

Corollary 1. For any 5 G (0; 1], it holds with probability 1 — S over a sample >S'(x,y)~i5 that : 



ICII < lie. 



\ 



^ 1 K 

2i^5:^iog^, 

rrik 

k=i 



where C5 is the empirical confusion matrix computer for a classifier h over the training sample S. 

Instead of minimizing the operator norm of the empirical confusion matrix C5, we propose to minimize 
an upper bound of ||Cs||. 



< y/Tr{C*sCs) 

The matrix C^Cs is positive semi-definite, hence all its eigenvalues are positives. The equality is 
simply the rewrite of the operator norm of Cg, while the inequality comes from the fact that the trace is 
equal to the sum of the eigenvalues. We focus now on the value of Tr{C*gCs)- 

Tr(C*Cs) - EliC*sCsilJ) 

— 1^1=1 l^j=l 

1=1 j^i 

The first and the second equality come from the definition of the trace and the confusion matrix, while 
the third equality comes from the fact that the diagonal entries of the confusion matrix C5 are 0. 

As for the inequality, it is simply the consequence of the fact that all the entries of the confusion matrix 
are smaller than 1. Finally, the last equality is obtained using the definition of the entries of Cs- 
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In the next step, we make use of the boosting framework given in lMukheriee fc Schapird (|2011l ). and 
more precisely of the definitions of fx and H. For readability reasons, friijj) — frihUi) is noted as 
A/T(i,j,J/»). 

K m 
i=l 

= EtiE.^y. i^nm) = J] 

m 
m 

— L-i L-i my. 

In order to obtain the first inequality, note that the term "^j-^y. I[H(i) = j] is non zero (and equal to 
1) only for those examples on which the classifier H errs. That is to say that there exists at least one 
j ^ Hi such as fri'i,.]) > fTihyi)- Hence the term ^ exp(/T(i, j) — /T(*,2/i)) is at least equal to 1. The 

last inequality follows quite naturally. 

Taking a step back and looking at what we've obtained so far, gives: 

m 

\\C\\<VtHC^ andTr{C*sCs)<Y.LT{i), (8) 

i=l 

where Lxii) — E '^'^v{fT,i,j ~ fT,i,yi): is the loss computed for the example {xi,yi). The final step 

consists in putting together these results. 

The different losses Lt{-) on the examples of the training sample are computed after round T. Since 
all the different parameters, such as at and ht (Vt), are known, we can redefine the score function fx as 
follows: 

I) = ELiQJ[^t(^) = ^ 

which is simply a normalized version of the original score function given in equation [21 The second result 
obtained in equation[5]is still correct, since the new score function does not change the predictions returned 
by H. 

The term Ei=i^r(j) takes its minimal value when all the classifiers ht correctly classify all the 
examples of S, that is ^i, frihl/i) = Et=i '^t or, replacing Jt by '^i, fT{i,yi) — 1. Rewriting the 

loss with f^°™, we have: 

m 

T {■\ 1 ^,r-r^/ fnorm frLorm\ 

t=l^T[i) - 1^ 1^ —G^PUT,t.j - JT,i,vJ 

1 a-vr^( fnorm £norm\ i 

— ^^y\JT,i,j JT,i,yi J "I 

I 1 cx-^-r^f fnorm rnorm\ 

ieSi isSjf 

^ K{K-1) 

e 

This result shows that if if > 2 than the loss El^i -^t(*) after iteration T is strictly greater than 1. 
Since we are in a multiclass setting, X > 2 is not too much of a limitation. The direct consequence of this 
result is the following inequality: 



Tr{C*sCs) < 



\ 



J2LT{^)<Y.LT{^). (9) 
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Combining equation [S] and the first result of equation [51 we finally obtain: 

m 

\\Cs\\<J2LT{i). (10) 



i=l 



3.3 Choosing the confusion matrix 

The result obtained in equation 1101 gives an upper bound on the norm of the confusion matrix, which is 
the general loss Lt = Y^^Li -^t(*)- Taking a closer look to this loss, one may notice that it is merely the 
sum of simpler loss functions, each one defined on an example from the training set S, that is Lxii) = 
'^i^y. ;;z-exp/T(j,0 - fT{i,yt)- 



In order to use the boosting framework presented in lMukheriee fc Schapird (|2011l ) we need to define a 
cost matrix D so that Vi, and 7^ yi,'D{i,l) < D(i,yi). Moreover D should be such that the total loss 
computed on D should be equal to Lt = X^I^i ^t(*)- 

Taking into account the fact that the different singular losses are fairly similar to the one used in 
AdaBoost.MM, the most straightforward choice for D is the following : 



def I ^ exp(/T,M - fT.^,y^) if I + V. 

Dt(*, = <j - ^ -L exp(/T,..j - /t,z,,J otherwise, (H) 



4 The core of CoMBo 

In this section, we introduce a new boosting method based on the results obtained in the previous sections 
and we show that the loss decreases after every step of the algorithm, similar to AdaBoost.MM. 



4.1 The Confusion Matrix Boosting Algorithm 

The pseudo-code of the proposed method is given in algorithm [T] The inputs for this algorithm are the 
classical inputs for all boosting methods similar to the AdaBoost family, that is, a training sample S', the 
total number of iterations T and a weak learner WL. During the initialization step, the score functions / 
are set to zero and the cost matrix D is initialized accordingly. 

The training phase consists of two steps: using the weak learner WL in order to build the set of weak 
classifiers and using the predictions of ht to update the cost matrix D^. At each round i, WL takes as 
input the cost matrix Dj and returns a weak classifier ht. The cost matrix is then used to compute the 
weight at for /ij, which can be seen as the importance given to ht- at depends on the edge 8t obtained by 
ht over the cost matrix D^. The underlying idea is that the better ht performs over Dt, the greater the 
edge 6t and the importance given to ht- 

The update rule for the cost matrix is designed so that the misclassification cost is increased for the 
examples misclassified by ht and is decreased for the correctly classified ones. This forces the weak learner 
WL to focus on the most difficult examples. The main difference between our method and AdaBoost.MM 
is the use of the term in the update rule, where my. is the number of examples having the same class 
yi. The direct consequence of this, is that the misclassification cost on an example ixi^yi) depends also 
from the number of examples of S having the same class yi. 

The output hypothesis is a simple weighted majority vote over the whole set of weak classifiers. So, 
for a given example, the outputted prediction is the class that obtains the biggest score. 



4.2 Bounding the loss 



First off we recall the minimal weak learning condition as given in iMukheriee fc Schapird (|201ll) . 

Definition 4. (Minimal weak learning condition) Let D'^°^ he the space of all cost matrices D which put 
the least cost on the correct label, that is \/{xi,yi),l,'D{i,yi) < D(i,/). Let B^°^ be the space of baselines 
B which are 7 more likely to predict the correct label for every example {xi,yi), i.e. VZ 7^ yi,'Q{i,yi) > 
B(i,Z) -I-7. Then, the minimal weak learning condition is given by : 

yn e D''°\3heH:'D-lh< maxD-B, (12) 
where H is a classifier space, and 1^ is the prediction matrix defined as lh{i,l) = I[h{i) = I]. 
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Algorithm 1 CoMBo : Confusion Matrix BOosting 



Given 

• S = {{xi.yi), {x„i,ym)} where Xi e X, yi e {!, - ■ ■ ,K} 

• T the number of iterations, WL a weak learner 

. Vze {I,--- ,to}, V/e {I,--- /l(^,0 = 

for f = 1 to T do 

Get ht with edge St on Dt, where: 

^ - J2Zi'^t{i,htix,)) 



Compute at as: 
Update D: 

Dt+i(i,0 = 



1, l + St 



:^exp(/f+i(i,?) - /t+i(i,j/j)) if ^ 7^ 2/j 

k 

"d^ E exp(/t+i,,j - /t+i,i,j,.) if / = 



where ft+i{i,l) = = 

z=l 

end for 

Output final hypothesis : 

H{x) = argmax/T(a:, Oj 
iei,...,k 

T 



where fxixj) — ^I[/if(a;) = l]at 



In the rest of this paper, we will consider a particular case of baselines, which are the closest to the 
uniform. These baselines, noted U-^, have weights (1 — 7)/fc on incorrect labels and (1 — 7)//j + 7 on the 
correct ones. The weak learning condition is given by : 

D-1^<D-U^ (13) 

All of the weak classifiers returned by WL during the training phase verify this weak learner condition. 

The following result shows that the general loss Lt decreases with each iteration, if the weak classifier 
ht satisfies the weak learning condition. This result and its proof are fairly similar to the ones given for 
AdaBoost.MM. 

Lemma 1. Suppose the cost matrix Dt is chosen as in the algorithm{Jl and the returned classifier ht^m 
satisfies the edge condition for the baseline XJst and cost matrix Dt, i-e. Dt • l/i^ < Dt • U^^ . 

m 

Then choosing a weight at > for ht makes the loss ^ ^ exp(/t(i, I) — ft(i, Hi)), at most a factor 

1 - \ (e"* - e-"')^t + \ (e"' + e""' - 2) 

of the loss before choosing at , where dt = edge of ht . 

Proof. Recall that the loss function Lt, and Lt{i) are defined as 

it = XI X! ~ exp(/t,i,, - ft,^,y,) 

i=l l^yi i—1 ly^yi 
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The weak classifier ht returned by WL satisfies the edge condition, that is: 

Dfl,,, <DfU5,, (14) 

with 5t being the edge of ht on Dt. 

Denote S+ (resp. S-) the set of examples of 5* correctly classified (resp. misclassified) by ht- Using 
the different definitions of Dt, \ht and U^j, the classification cost of ht (left side of [H)) is given by: 

Dt ■ 1/.* = - E Lt-i{i) 

+ E ■:^<i^Y>{ft-l,^,ht{i) - ft-i,i,yi) 

ieS- 

= ~A% + At 
while the cost of U^^ (right side of [T4| is given by: 

m 

Dt • Us,,,„ = -St J2 = -StLt-i 

1=1 

Injecting these two costs in[T31 we have : 

A\ ~ A'_ > StLt^i. (15) 

If we take a closer look at the drop of the loss after choosing ht and its weight at, we have: 

Lt-i-Lt = E Lt_i(z)(l-e-"') 

+ E ^eM^ft-li^,ht,y^))il - e"^) 
ieS- 

(1 - e-°'*)A% - (e"' - 1)A[_ 
-{'^^^) (AX -At) 

where Aft-i{i, ht,yi)) = ft-i{i, ht{i)) - ft^i{i,yi). 

The result in [15] gives a lower bound for A\_ — At, while A*^_ + At is upper-bounded by Lt-i- Hence, 

Lt-i-Lt = E it-i(*)(l-e-"*) 

> {'^^^) StLt-i - ( -°'+-r*-^ ) Lt-,. 

Therefore, the result of the lemma: 

Lt < (l - {^^^^)St + 

= (i((l - (5t)e"' + (1 + (5t)e-"')) Lt-i. 



(16) 



□ 

The expression of the loss drop given in the Lemma [1] can be further simplified. Indeed, if we choose 
the value of at as given in the pseudo-code of Algorithm [1] than the loss drop is simply equal to ^1 — (5|. 
Since the value of is always positive, y'l — Sf is smaller than 1, thus the loss Lt is always smaller than 
Lt-i- The following theorem resumes this result. 

Theorem 1. Let Si, ■ ■ ■ ,St be the edges of the classifiers hi, ■ ■ ■ ,hT returned by WL at each round of the 
learning phase. Then the error after T rounds is K{K~1) Y\^^i \/\ — S^ < K{K—1) exp | — (1/2) Et=i ■ 

Moreover, if there exists a ^ so that Vt, St > 7, then the error after T rounds is exponentially small, 
K{K-l)e-^^ /\ 
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dataset 


\Ac^Ti^^r^^ i\/ri\/r 
Aciar>OOSt.iVliVl 


L/OiVirSO 


A a ars s t . ivi ivi 


L^OMrSO 




II r~"i 


II r"ii 


error 


error 


— A 1 

Abalone 


z.yoy 




U.U ( 


U.UoO 


oar 


U.ozo 


u.uoo 


U.UUz 


u.uuo 


v^omiecL-^ 


1 

1 .uuo 






U.UoO 


Nursery 


0.923 


0.201 


0.007 


0.012 


Poker 


1.778 


0.471 


0.033 


0.050 


Pcndigits 


0.004 


0.011 


0.000 


0.000 


Im. Segm. 


0.138 


0.257 


0.004 


0.006 


Letter 


0.187 


0.194 


0.013 


0.013 



Table 1: Adaboost.MM vs. CoMBo on error and norm of the confusion matrix. The last three datasets 
are balanced. 

5 Experimental results 

5.1 Datasets and experimental setup 

8 datasets were used in the experiments, th e same used in Mukherjee fc Schaoird ( 2011 ). They are all 



from the UCI Machine Learning Repository iFrank fc AsuncionI \2Ql^ . and are all related to multi-class 
learning tasks, mainly classification. They exhibit various degrees of imbalanced data, as well as various 
number of instances and attributes. Since the work is concerned with multi-class imbalanced datasets, 
class distributions must be specified: 

• datasets Pendigits (10 classes). Letter (26 classes) and Image segmentation (7 classes) are bal- 
anced. 

• Abalone is fairly imbalanced. It features 28 classes: 13 classes represent less than 1% of the total 
number of instances each, while 4 classes represent more than 10% out of the total number of 
instances each. Each of the 11 remaining classes represents between 1 and 10% of the dataset. 

• The dataset Car features 4 classes, which contains respectively 70.023%, 22.222%, 3.993%, and 
3.762% of the instances: the two last classes are much less represented than the two first ones. 

• The dataset Connect-4 features 3 classes, which contains respectively 65.83%, 24.62%, and 9.55% 
of the instances. 

• The dataset Nursery has 5 classes, which contain respectively 33.333%, 0.015%, 2.531%, 32.917%, 
and 31.204% of the population. 

• The dataset PokerHand is the most imbalanced. It features 10 classes, where the four first contains 
respectively around 50%, 42%, 5% and 2% of the dataset. Each of the six other classes represents 
less than 0.5% of the dataset. 

For each dataset, we performed 10-folds cross-validations and averaged the results. Two measures are 
reported: estimations of the error and of the confusion matrix norm. CoMBo and Adaboost.MM ran for 
200 iterations. 



5.2 Results 

Results are presented in table [TJ the estimated confusion matrix norms are reported, together with the 
estimated generalization errors. 

The results on balanced datasets (Letter, Pendigits and Im. segmentation) are similar with Ad- 
aboost.MM and CoMBo: estimated errors and norms of the confusion matrix are deeply close. These 
preliminar results let us think that, in case of multi-class balanced datasets, there is no gain using CoMBo 
instead of Adaboost.MM, but there is no loss either (the computational times are quite the same). 

Concerning imbalanced datasets, CoMBo turns out challenging. The estimated real error with CoMBO 
leans to be a bit worse than the one of Adaboost.MM. Meanwhile, as expected, the estimated norm of the 
confusion matrix is much smaller with CoMBo. Having a closer look at results on the PokerHand fairly 
imbalanced dataset, the decreasing of the norm is drastic. It confirms that using CoMBo, the accuracy 
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/O.OOO 0.006 0.000 0.001 
0.000 0.000 0.000 0.015 
0.000 0.167 0.000 0.000 
\0.000 0.250 0.013 0.000/ 
Adaboost.MM on Car 



/O.OOO 0.045 0.004 0.007' 
0.000 0.000 0.021 0.010 
0.000 0.000 0.000 0.013 

\0.000 0.000 0.000 0.000/ 

CoMBo on Car 



^0.000 0.000 0.048> 
0.890 0.000 0.110 
yO.582 0.000 O.OOOy 
Adaboost.MM on Connect 



^0.000 0.232 0.137^ 
0.181 0.000 0.212 

^0.079 0.266 O.OOOy 
CoMBo on Connect 



Table 2: Confusion matrices obtained with Adaboost.MM and CoMBo, on datasets Car and Connect-4. 



of the classifier on minority classes is improved, whereas it turns sour on majority classes. Somehow, 
we could say that the performances of the classifier is smoothed among all the classes, whatever their 
representation are within the dataset. That way, majority classes are not as promoted as usually in 
multi-class approaches. 

Let us illustrate on table [5] what actually occurs on the two smallest imbalanced datasets, where 
the norm of the confusion matrix decreases with CoMBo: Car and Connect-4. Except for the diagonal 
entricfQ, each {l,j) represents the probability that instances of class I are classified as j by the classifier. 
Hence, the accuracy of class / is estimated as 1 — ^j-^i C{l,j). 

With CoMBo on the Car dataset, the errors on minority classes 3 and 4 get smaller, while the error 
on the first (majority) class increases w.r.t. Adabost.MM. This the smoothing effect of CoMBo. 

On the Connect-4 dataset, the misclassification rates on classes 2 (100%) and 3 (58.2%) are dramat- 
ically high with Adaboost.MM which promotes the majority class. With CoMBo, classes 2 and 3 are as 
well recognized as class 1, although class 1 is the majority class. 

In both datasets, the estimated real error is higher with CoMBo: this is explained by the fact that 
misclassified examples of the majority classes getting more numerous, it directly impacts the overall error. 

Such a behavior of CoMBo points out that it equally considers each class during the learning process. 
These experiments acknowledge the smoothed learning processed by CoMBo over imbalanced classes. 
Then, the integration of cost-sensitive errors could be easily performed during the minimization process 
on ||C||. 



6 Discussion 

The method proposed in this paper aims at minimizing the operator norm of the confusion matrix, which 
is used as a performance measure, instead of the classical misclassification error. In order to do so, we 
proposed in section [3] to minimize an upper-bound of the norm of the empirical confusion matrix. Indeed, 
it is regrettable to make use of an upper -bound, since w hat we wish to minimize is the norm itself, in 



the same fashion as the COPA algorithm iRalaivolal (|2012l ). The first natural follow up of this work is to 



obtain a novel method which greedily minimizes the norm of the confusion matrix. We think that using 
the same multi-class boosting framework as the one used here, is be the best way to tackle this problem. 

In section 13.21 we mentioned that the confusion matrix used in this paper is just a particular case of 
a larger family of confusion matrices. The matrices in this family are obtained by replacing the indicator 
function in equations S] and [5] by a loss function. Doing so allows us to consider special cases of matrices, 
such as the fact that confusing class a with class b is worse than confusing a and c (think about automated 
diagnosis system). The second perspective of this work is thus to extend the results obtained here, to 
the more general family of loss-based confusion matrices. It then could easily integrate any prior on cost- 
sensitive misclassifications, hopefully as a constant within the norm of the generalized confusion matrix. 

Last but not least, we would like to mention another possible extension of the confusion matrix as a 
performance measure framework. Confusion matrices are generally defined for learning samples, but they 
can also be defined for an ensemble of classifiers. Hence future work will also be focused on using confusion 
matrices for the estimation of the performances for ensembles of classifiers and, hopefully, obtaining new 
bound for methods based on ensemble learning. 



^Recall that the diagonal is set to zero for it must be as high as possible, thus it is not taken into account during the 
norm minimization process. 
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7 Conclusion 

We proposed a novel method based on a muhi-class boosting framework that uses the operator norm of 
the confusion matrix as an alternate performance measure, taking advantages of the richer informations 
embedded in this matrix. Our main contribution hes in the fact that, to the best of our knowledge, this is 
the first boosting method based on the idea of minimizing the norm of the confusion matrix. We prowed 
in section [3] how to obtain a loss function which upper-bounds the operator norm of the confusion matrix, 
and how to obtain a boosting method that minimizes this loss function. Our method is given in section 
m and we showed in the same section that the loss decreases with each round. Finally, the experimental 
results given in section [S] show that our method performs better than AdaBoost.MM when the norm of the 
confusion matrix is considered as a performance measure. This method fairs better than AdaBoost.MM 
in the case of imbalanced samples. 

References 

Bordes, S. Ertekin, J. Weston and Bottou, L. Fast kernel classifiers with online and active learning. J. 
Machine Learning Research, 6:1579-1619, 2005. 

Chapelle, O. and Chang., Y. Yahoo! learning to rank challenge overview. In JMLR Workshop and 
Conference Proceedings. 

Cortes, Corinna and Mohri, Mehryar. Auc optimization vs. error rate minimization. In Press, MIT (ed.). 
Advances in Neural Information Processing Systems (NIPS 2003), volume 16, Vancouver, Canada, 2004. 

Estabrooks, Andrew, Jo, Taeho, and Japkowicz, Nathalie. A multiple resampling method for learning 
from imbalanced data sets. Computational Intelligence, 20(l):18-36, 2004. 

Fawcett, Tom. An introduction to roc analysis. Pattern Recogn. Lett., 27(8):861-874, June 2006. 

Frank, A. and Asuncion, A. UCI machine learning repository, 2010. URL 

|http : / / archive . ics . uci . edu/ml) 

He, Haibo and Garcia, Edwardo A. Learning from imbalanced data. IEEE Trans, on Knowl. and Data 
Eng., 21(9):1263-1284, September 2009. 

K. Tang, R. Wang and 2011., T. Chen 7-11 August. Towards maximizing the area under the roc curve for 
multi-class classification problems. In Twenty-Fifth AAAI Conference on Artificial Intelligence (AAAI 
2011). 

Morvant, Emilie, Kogo, Sokol, and Ralaivola, Liva. PAC-Bayesian Generalization Bound on Confusion 
Matrix for Multi-Class Classification. In International Conference on Machine Learning, pp. 815-822, 
2012. 

Mukherjee, Indraneel and Schapire, Robert E. A theory of multiclass boosting. CoRR, abs/1108.2989, 
2011. 

Ralaivola, Liva. Confusion-based online learning and a passive-aggressive scheme. In Neural Information 
Processing Systems Conference, 2012. 

Ting, K.M. "a comparative study of cost-sensitive boosting algorithms". In Int'l Conf. Machine Learning, 
pp. 983-990, 2000. 

Wang, Huanjing, Khoshgoftaar, Taghi M., and Napolitano, Amri. Software measurement data reduction 
using ensemble techniques. Neurocomputing, 92:124 - 132, 2012. 

Yue, Yisong, Finley, Thomas, Radlinski, Filip, and Joachims, Thorsten. A support vector method for 
optimizing average precision. In SIGIR, pp. 271-278, 2007. 



Technical Report V 1.0 



11 



